Colab by ironmanizawesome · Pull Request #4 · CandleLabAI/PCBSegClassNet

ironmanizawesome · 2026-05-05T18:36:43Z

No description provided.

PCBClassNet.build() was passing the (model, learning_layer1, learning_layer2) tuple straight into get_classification, which expects a single Keras Model. Unpack so the classification head receives the encoder model as intended, making the classification path actually buildable. Also adds CLAUDE.md (project guidance) and ignores .claude/ working state plus training log files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds notebooks/colab_train.ipynb covering the full pipeline (clone, TF 2.10 pin, Drive mount, data unzip, seg + class training with checkpoint backup to Drive) so an 8 GB local GPU isn't a blocker. Pins TF 2.10.1 + keras 2.10 + protobuf 3.19.6 in the install cell — Colab's bundled TF (2.15 with Keras 3) breaks `tf.keras.activations.softmax` calls and a few other patterns this codebase relies on. notebooks/README.md captures the data zip layout, why TF 2.10, and a VRAM cheat sheet for the common Colab GPUs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Colab's default Python is 3.12, which has no TF 2.10 wheels available (`pip install tensorflow==2.10.1` fails with "No matching distribution"). Insert a condacolab.install() step that swaps the kernel to a Python 3.10 base, then install the verified TF 2.10 stack on top. The kernel auto-restarts after condacolab.install(); the cloned repo on /content survives the restart so subsequent cells just resume. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Restructures the notebook so the entire data prep pipeline runs in Colab from the raw FPIC archive (~7 GB) instead of requiring a pre-zipped processed dataset (~18 GB): - §4 unzips data_raw.zip (pcb_image + smd_annotation) - §5 runs create_mask.py (GPU-accelerated EDSR upscaling) - §6 runs create_patches.py (768 px patches + 80/20 train/val split) - §§7-10 unchanged training/eval flow with section numbers shifted Caps full training at 40 epochs for both segmentation and classification. Colab Pro caps a single session at 24 h with a 90-min idle limit and no background execution; Seg 100 + Class 100 (~30-37 h) cannot fit. Seg 40 + Class 40 fits comfortably in roughly 12 h on a T4. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The latest condacolab defaults to Python 3.11, which TF 2.10 also has no wheels for (only 3.7–3.10). Pass python_version="3.10" so the kernel restart lands on a Python 3.10 base that the TF 2.10 install can match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop the condacolab Python 3.10 dance. Colab's default Python keeps moving past TF 2.10's wheel matrix (now 3.11/3.12), and the latest condacolab doesn't accept python_version on install_miniforge. TF 2.15 is the last TF release on Keras 2 (Keras 3 starts at TF 2.16) and ships wheels for the Python versions Colab actually serves, so the codebase's tf.keras.backend.{dot,transpose} usage keeps working with no source changes. Also rewrites the notebook from scratch to clean up duplicate cells that crept in during incremental NotebookEdit changes (two ## 6 / ## 7 sections, both 100- and 40-epoch training cells, missing sanity cells). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Plain `pip install tensorflow==2.15.0` on Colab falls back to CPU because Colab's bundled CUDA libs are pinned to whatever TF version Colab ships, not 2.15. The `[and-cuda]` extra pulls in matching nvidia-cudnn-cu12 / cublas-cu12 / etc. wheels alongside TF, which is what TF's GPU loader actually expects to dlopen. Without this, training falls back to CPU and create_mask.py / train_*.py take ~10× longer with periodic "Cannot dlopen some GPU libraries" warnings in stderr. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…oken) tensorflow[and-cuda]==2.15.0 fails to resolve because the extra pins tensorrt-libs==8.6.1, which has been removed from PyPI (only 9.x is still available). Drop the bracket extra and install nvidia-cudnn-cu12, nvidia-cublas-cu12, etc. by name in a separate pip call. TF needs them at dlopen time but doesn't actually use TensorRT for training. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Colab's notebook kernel runs on Python 3.12, but TF 2.15 only ships wheels for Python 3.9–3.11. Colab images already include /usr/local/bin/python3.11; install the TF 2.15 stack into that interpreter and run create_mask.py / create_patches.py / train_*.py via !python3.11 instead of !python. The notebook kernel itself stays on Python 3.12 — we never import tensorflow from kernel cells, just shell-out to python3.11 for everything that touches TF. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…cell DISLoss's SSIM gradient backward path spikes a 416 MB tensor ([batch=16, 26 classes, 512, 512]) that fragments allocator on T4 16 GB GPUs and OOMs even though plenty of free memory exists. TF itself recommends `cuda_malloc_async` in this case. Add it as a prefix to every train/eval invocation so the recommendation actually fires; on L4 24 GB it's redundant but harmless. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

L4 Colab images don't ship python3.11 (T4 ones do). Add a guard that installs python3.11 from deadsnakes PPA when it's missing. Pin every nvidia-*-cu12 wheel to the version TF 2.15 expects to dlopen: - nvidia-cudnn-cu12==8.9.4.25 (latest is 9.x; TF 2.15 needs libcudnn.so.8) - nvidia-cublas-cu12==12.2.5.6 etc. Without these pins TF 2.15 falls back to CPU on a fresh runtime because it can't find the right .so versions, and the warnings are easy to miss. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

40 was too aggressive a cut from the paper's 100. 80 is the sweet spot: enough room for ReduceLROnPlateau (patience=15) to fire and fine-tune, while still fitting inside Colab Pro's 24 h session limit (~9h per model on L4 = 18h total + preprocessing buffer). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The first 80-epoch segmentation run lands val_dice around 0.71 with train dice 0.92 (clear overfit) and lr already at min_lr=1e-5. Add a second optional stage that resumes from best_seg.h5 with a lower lr range so ReduceLROnPlateau can keep stepping down past 1e-5. Changes: - train_segmentation.py: -resume CLI flag; when set, model.load_weights is called on the configured checkpoint path before fit(). - src/cfs/pscn_seg_finetune.yml: same architecture as pscn_seg.yml but lr=1e-5 (where the first run left off) and min_lr=1e-6. - notebooks/colab_train.ipynb: new §8b that restores best_seg.h5 from Drive if missing, runs 20 epochs with -resume + the finetune config, then re-mirrors the best checkpoint. - .gitignore: ignore /best_*.h5 and root-level *.zip (Colab artifacts that landed in the working tree). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Brings the fine-tune second stage into the main Colab branch so users running the notebook always have §8b available without switching branches.

TF 2.15 lazy-loads tf.keras, and accessing __version__ on it raises AttributeError mid-cell, swallowing the GPU print that follows. Print TF version + GPU list only; users who specifically need the keras version can run it in a separate cell. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ironmanizawesome and others added 2 commits May 6, 2026 02:18

ironmanizawesome marked this pull request as draft May 5, 2026 18:37

ironmanizawesome and others added 13 commits May 6, 2026 04:03

Merge branch 'seg-finetune' into colab

6d83ed4

Brings the fine-tune second stage into the main Colab branch so users running the notebook always have §8b available without switching branches.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colab#4

Colab#4
ironmanizawesome wants to merge 15 commits into
CandleLabAI:mainfrom
ironmanizawesome:colab

ironmanizawesome commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ironmanizawesome commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant